1 Introduction

I must have tried to start coding half a dozen times before I actually was able to retain anything for longer than a couple of hours. That’s because I was learning for the sake of learning and didn’t have much of a scaffolding or context in which to place this knowledge for it to make any sense. If I had a specific thing I wanted to solve or a target I was working towards (kind of like what we’re doing in this bootcamp), it probably would have been much easier. As such, I think it’s really important early on to code towards a goal. Luckily, R can be a pretty goal-focused tool. It’s a language built specifically for statistical analysis (though it has sprawled to be useful for nearly everything else, too). We can use it to answer questions like, Are there differences between Group A and Group B? or Does Variable A correlate with Variable B?. You’ve already gotten familiar with a terminal in Day 1, and you also saw some data wrangling techniques in Day 2. Though we could certainly do much of the same things in R, we’re going to heavily focus today on where R shines: conducting statistical analyses and visualizing that data. By the end of the day today, we’re specifically going to answer:

1.) Are there sex differences in which participants have complete MRI data?
2.) Do we find sex differences in gray matter volume?
3.) Do we find any association between age and gray matter volume?
4.) Does sleep quality predict gray matter volume?

We’re going to breeze over some of the statistical theory because we don’t have all the time in the world, but even without a solid statistical background, you should nonetheless be able to follow along in practice. This is not an exercise in statistical knowledge; it is an exercise in coding, but the content just happens to be statistical.

A fair warning
A fair warning

1.1 Opening A New Markdown File

First things first: if you haven’t already done so, open R Studio. Then, you’re going to want to open a new R Markdown document, by clicking File > New File > R Markdown…

How to load a new markdown file
How to load a new markdown file

This should produce a dialogue box where you can enter the name of the script and your name before selecting OK.

What to enter for a new markdown dialogue box
What to enter for a new markdown dialogue box

Then, I’d recommend clearing out all of the default text that appears in a new R Markdown document, which I have highlighted below:
We don’t need all this junk

1.2 An R Markdown Primer

Yesterday and the day before, we used a typical coding script, where every line only contained code that the language could interpret, even if what the code was interpretting was “ignore this line” as is the case with hash marks (#). In order to leave ourselves any notes, we had to use hash marks, and all of the output from our script either appeared in the terminal or in a separate window, as was the case for Spyder. A Markdown script does the same things as a typical coding script, but it gets a little bit fancier.

  1. If we want to leave notes, we don’t have to “comment it out”. We can just write long-winded narration that can help others understand why we coded what we coded and what that code does. That’s because…

  2. A typical script will interpret any text as a command, unless the text is otherwise marked by a hashtag (#). A markdown script only interprets things as code when we tell it to, and we tell it what is code by creating a chunk. Chunks are marked by three backticks (```) followed by a {r} and, on another line, three more backticks.
    An example of a chunk
    A typical script can’t make sense of this, though. We need to use markdown scripts to do it. You might be thinking, though, that manually denoting code from non-code seems like extra work, and it is a little bit, but it can also be a lot more convenient because …

  3. The output of any given chunk will appear just below that chunk, rather than in the R Studio Console Window or Spyder Console Window. By output, we just mean the product, sum, or status of whatever calculation or item you are asking R to compute and show you.
    An example of chunk output
    You can still check the console window if you like; the output will still be there, but you also have a record that centralizes all of the output into a single window and can be easy to follow along. That’s because…

  4. Markdown grants us greater control over what we see and when we see it. To demonstrate, let’s start by creating a new chunk in our markdown document and entering what we see in the image above, you can then follow along with the next bit:

2 + 4
## [1] 6

With a typical script, if we want to know the output of a line we ran awhile ago, we either have to rerun it or scroll through the console to find it. With Markdown we can minimize entire chunks and their output by using the minimization button [ Minimization Arrow ] on the left side of the window. If we want to hide output, we can use the expand/collapse button [ Minimize Command ] on the right side of the output window We can choose exactly what we want to run using the the “Run” command [ Run Command ] in the upper right corner of the chunk. If you want to export the output, whatever it may be, to a new window, simply click the first icon of the the three in the upper right corner of the output box [ Export Command ], and to delete, press the third [ Delete Command ]. Note that deleting the output does not affect the related code. Also of note, the down-facing arrow (second icon in the upper right corner of the code block) will tell R “Run all of the blocks of command that I have before this block” [ Run All Chunks Command ]. It can be helpful if you make a mistake and don’t want to manually rerun all of the previous blocks one by one to get back to where you were. It also makes your code very easy for other people to run. They can quite literally do it with the click of a button! If we click the cog icon in the same tray, we can access the output options and manipulate where output appears and what it looks like, but that’s beyond the scope of this review [ Settings Command ].

  1. Science is all about replicability and sharing information. As such, it’s super important that others can not only come to the same conclusions, but also undertstand how we came to those conclusions in the first place. Using an R Markdown script will allow us, once we are done, to compile all of our work, or “knit” it, into an interactive .pdf or .html document which others can follow. For an example, look no further than the .html files you all have been reading over the past couple of days! They are very helpful if you want to run a stat coding blog, document your analytic process for peer reviewers, or, say, teach a workshop on how to code in R.
    Now that we’ve got that out of the way, let’s get started!

1.3 What’s a “Package”?

Packages in R are synonymous with libraries in other languages. They are more or less convenient short-cuts or functions someone else already programmed to save us some work. Somebody else already figured out a very quick way to compute a regression so now we don’t have to! We just use their tools to do it.

Every new package is centralized in R’s repository, so even though thousands of people are working on these things independently, you don’t need to leave R to find them. Before they can be used, they must be installed, and you can do that pretty simply:

  install.packages("PACKAGENAME")

Hopefully you’ve already done this, though. If need to update a package, you can just rerun the above code. If you’re using R Studio, you can also see a list of your packages and their associated descriptions in the ‘Packages’ Tab of your Viewer Window.

Packages tab of viewer window where one can visualize previously installed packages
Packages tab of viewer window where one can visualize previously installed packages

Now we’ve installed a package, but that doesn’t mean we can use it yet. We need to tell R “We want access to the functions this package has during this session by calling it with the library() command.

  library(PACKAGENAME)

Notice that we drop the quotation marks now. We just specify the (case-sensitive) package name and it let’s R know we are planning on using that this session.

You might be wondering why we need to take this extra step. Sometimes different packages use the same commands, so having more than one of those active at the same time could confuse R (When this does happen, R will usually tell you). Sometimes packages take up a lot of disk space, so having ALL of your packages initialized at once might leave your computer running extremely slow. It’s the same for most languages.

If we ever want to explore the functions contained within a package in conjunction with examples, we can either go to the R documentation website or type ‘??PackageName’ into the Console, which will then populate the Help Tab of the Viewer Window with information on the package.

1.3.1 Exercise 1

Load the following packages:

naniar
report
skimr
stargazer
tidyverse

Don’t forget!: This is code, so if we’re using a markdown script, it’ll need to be entered into a Chunk.

'



'

Click for solution

1.4 Specifying Your Working Directory in R

Like any other language or program, R needs to be told where the data that we’d like to work with is located on our computer. It doesn’t just know automatically. We’ll probably have to tell it more than once where files are located, so we can create a new variable containing a filepath to make this process simple so we aren’t writing it out multiple times. I’m using a Windows computer, so nearly everything is contained within my C:/ Drive. If I were on a Mac, I’d start with a forwardslash. If I wasn’t sure of my path, R makes it relatively easy to find it. If I start by entering this:

# For Windows
Path <- "C:/"

# For Mac
Path <- "/"

I can press tab when my cursor is to the left of the slash to see a list of directories contained within my C:/ Drive. Here’s an example of what you should see:

An example of R’s Tab-Controlled Dropdown Menus
An example of R’s Tab-Controlled Dropdown Menus

Pressing tab again will enter into a directory, thus showing me the contents of that directory. From there, I can keep hitting tab until I get to the directory, or folder, that contains the files I want to work with. I can then save this filepath, which is just what we call a string (i.e., text containing not quantiative value), as an object named Path. We do so by placing the object on the left of an equal sign (=) or an arrow (<-) and the value that object is taking on the right side of it.

# For Windows
Path <- "C:/Users/Administrator/Documents/Github/intro-to-coding-2023/data/"

# For Mac
Path <- "/Users/Administrator/Documents/Github/intro-to-coding-2023/data/"

This format of assigning a value to an object is really important and we’ll keep coming back to it throughout this tutorial.

1.5 What is a “dataframe”?

Before we load in the data, I want to highlight a little terminology. The data that R works with is always contained within what we call a ‘dataframe’. You’ve likely heard this term in the past workshops. It’s quite literally what it sounds like: a framework in which to situate your data. It contains many cells that are situated into columns (which have names) and rows (which may or may not have names). Think of an excel sheet, except instead of an effectively infinite void of blank columns and rows which you could scroll or write into at your convenience, in R and similar languages, we need to clearly denote the boundaries (how many columns and rows) the data is contained within. We do this whenever we create a dataframe from scratch, or we read in pre-existing data. Because this workshop builds on the previous ones, we don’t need to do the former; we can just ‘read in’ pre-existing data.

1.6 How do I load data into R?

There are many ways to load data into R and they all depend upon what format the data is in. R can handle data from .csv, .xlsx, .txt, .html, .json, SPSS, Stata, SAS, among others. R also has it’s own data format (.RDA, .Rdata). With the exception of .RDA, .csv is often the cleanest means of reading in data. We won’t cover the other formats, but they are fairly exhaustively covered in this tutorial.

In the most basic sense, we can load data like this:

  setwd(Path)
  df <- read.csv(file = "r_data.csv")

The setwd() command accepts our Path variable and tells R where to look for our .csv file. The read.csv() command actually loads in the data. If done correctly, we should see our Environment populate with a dataframe labeled df.

A visualization of the Environment Window. Note that the number of observations and variables may be different from the dataframe you are currently reading in.
A visualization of the Environment Window. Note that the number of observations and variables may be different from the dataframe you are currently reading in.

If you click on df in the environment, it will open in a new tab of your Source Window (The same window you are likely writing script in) where you can view it. However, we can also look at the data in our markdown file though by entering the head() command from base R, which will show us the first few lines:

  head(df)

There’s a few things that read.csv() assumes correctly right off the bat. The default settings assume that the first row contains header names, which is correct in this case, so we don’t need to specify that. Behind the scenes, read.csv() assumes that the different columns are divided by commas, which is also true in this case. If it weren’t the data would look totally jumbled. However, it’s not able to tell that the first column is just index information. Right now, it’s treating it as if it’s regular, informative data like all the other rows. We can fix that with a simple command:

  setwd(Path)
  df <- read.csv(file = "r_data.csv",
                 row.names = 1)
  head(df)

Notice that these arguments are separated by commas. If we don’t put commas between arguments contained within the same command, R won’t run the command. Also notice that we can put each of these arguments on their own lines. This isn’t necessary, but it makes the code easier to read. R is NOT sensitive to indentation like Python is, so this is truly just a stylistic preference, but one I’d recommend adopting.

Running them on the same line yields the same output:

  setwd(Path)
  df <- read.csv(file = "r_data.csv", row.names = 1)
  head(df)

Digressions aside: Amazing! Now we have six columns of data, like we should. We might also notice that the first row of column GM_Volume, which is blank or missing data, was correctly identifed as NA or Not Available. This is very important because R treats NA values as categorically distinct from other types of value, and we want those to be correctly identified, so we don’t need to worry about that either. However, there’s just one more thing we can specify that will save us a headache later. You might notice that Age is classified as a “Character” type, however, we want it classified as a “Factor”, which would suggest that it contains non-numeric values that can be organized in a logical order. The *stringsAsFactors argument can take care of that.

  setwd(Path)
  df <- read.csv(file = "r_data.csv", 
                 row.names = 1,
                 stringsAsFactors = TRUE)
  head(df)

Although we’re only using a few arguments, we could customize this command much more by adding more arguments. If you ever want to see a list of arguments that a command can understand, you can either:

1.) Type the command name into the Console or Source Window script, press Tab within the parentheses and
use the arrow keys to navigate the subsequent dropdown menu that appears.
2.) Enter ?CommandName in the Console and read documentation on it in the Viewer Window.

Option 1 is more convenient and probably a better choice for a command you are already familiar with. Option 2 is more exhaustive and great for a command you are still trying to understand.

A visualization of what you will see if you press Tab to view valid arguments for a given command. Notice that hovering over an argument results in a pop out window which briefly describes the argument.
A visualization of what you will see if you press Tab to view valid arguments for a given command. Notice that hovering over an argument results in a pop out window which briefly describes the argument.

1.6.1 Exercise 2

read.csv() contains a logical argument, which “if TRUE will check the names of the variables in the data frame to ensure that they are syntactically valid variable names. If necessary they are adjusted so that they are, and also to ensure that there are no duplicates.” Using one of the two methods highlighted above, find what that argument is and rewrite the data-reading command with that argument included.

'



'

Click for solution

1.7 Subsetting Data

We already saw that we aren’t starting with a perfect dataset: some of the data is missing! Sometimes, this isn’t such a bad thing, but we really need to know what data is missing and how it affects our analyses. The gg_miss_var() function from the naniar package tells us that information:

gg_miss_var(df)

In our case, we can see the only missing data that we seem to have is gray matter volume for eight participants, which isn’t bad. If we wanted to analyze our data, though, we might want to create a subset of it, or a dataframe excluding this participants. There’s a million ways to do this in R. We’re going to start with the easiest to follow using the subset() function though:

df_sub <- subset(df,
                 !is.na(df$GM_Volume))

This looks kind of similar to write.csv(). We have our command on the outside of the parentheses. We have the object that we’re targeting (df). We have it followed by a comma. Our second argument is where things get a little tricky, so let’s break it down, but let’s hope around a little bit. First, note that we see our dataframe name followed by a dollar sign followed by the column name that we know has missing data. If we want to call or target a specific column in R, this is how we do it. That object (df$GM_Volume) is the only argument in another command, is.na(). What is.na() does is identify whether or not a given value has a value of NA (remember, I said R treats NA values as categorically distinct!).

So, it will look through the whole column df$GM_Volume and identify which values are NAs and which aren’t. Placed where it is, it would tell R “Keep all the rows that have an NA in this column, and throw out all the rest”. However, there’s a little catch because of the bang (!) placed before it: this actually tells R “Do the opposite of what follows”.

All together this command tells R to look through the GM_Volume column in the df data frame, get rid of any rows where the value of GM_Volume is NA, and then save that resulting dataframe as an object called df_sub.

We can do this more concisely, though:

df_sub <- df[!is.na(df$GM_Volume),]

It looks different, but it does the same exact thing. Let me take a quick step back and talk about indexing in R, because that’s what allows us to do that above. I already mentioned if we want to call a specific column in R, we can index it using it’s name:

df$GM_Volume
##   [1]     NA 807245 664124 726206 762308 579632 665024     NA 707674 773472
##  [11]     NA 676282 717215 784458 690232 780767 772912 795282 609326 726045
##  [21] 771666 734371 628040 666937 634289 656325     NA 653952 580254 688424
##  [31] 701888 841393 664695 680517 668175     NA 681150 725170 813927 632781
##  [41]     NA 706819 612575 704006     NA 631207 759308 566888 713350 683011
##  [51] 675979 706210 749087 780242 545398 587201 668351 766961 755903 667947
##  [61] 582101 662773 745611 634207 731220 722399 597704 606040 818733 771253
##  [71] 753943 622474 728529 611182 744203 641917 644684 623000 769946 670102
##  [81] 669371 690017 631035 716081 589739 704146 690289 593625 756134     NA
##  [91] 779307 689932 673054 691945 624339 615245 708725 619845 644715 509704

If I want to see a specific row, let’s say Row 2, within that column, I can add brackets and the number of that row behind it:

df$GM_Volume[2]
## [1] 807245

However, we can also index the column using it’s relative position. Knowing that the GM_Volume column is the fifth column, I can do this:

df[,5]
##   [1]     NA 807245 664124 726206 762308 579632 665024     NA 707674 773472
##  [11]     NA 676282 717215 784458 690232 780767 772912 795282 609326 726045
##  [21] 771666 734371 628040 666937 634289 656325     NA 653952 580254 688424
##  [31] 701888 841393 664695 680517 668175     NA 681150 725170 813927 632781
##  [41]     NA 706819 612575 704006     NA 631207 759308 566888 713350 683011
##  [51] 675979 706210 749087 780242 545398 587201 668351 766961 755903 667947
##  [61] 582101 662773 745611 634207 731220 722399 597704 606040 818733 771253
##  [71] 753943 622474 728529 611182 744203 641917 644684 623000 769946 670102
##  [81] 669371 690017 631035 716081 589739 704146 690289 593625 756134     NA
##  [91] 779307 689932 673054 691945 624339 615245 708725 619845 644715 509704

When indexing a data frame using brackets, the first position refers to the row or rows and the second position refers to the column or columns. By leaving the first position blank, we’re telling R “Show me all of the rows for column 5”.

If I wanted to only see the value of the 2nd row for that column, I could do this:

df[2,5]
## [1] 807245

So, going back to our new subset technique:

df_sub <- df[!is.na(df$GM_Volume),]

What we’re telling R by including that !is.na(df$GM_Volume) command in the first index position is for R to only include rows that meet this condition, and we just went over that condition above. By leaving the second position blank, after the comma, we’re telling R “include all columns, though”.

1.8 Conditional Statements

Let’s say we wanted to only look at female participants, though. It’s pretty simple to do. We’d just construct a conditional statement. !is.na(df$GM_Volume) is a conditional statement, but, like I said, R handles NA’s categorically different from other values, so NA conditional statements look a little different than other conditional statements. Other conditional statements look pretty similarly to how they do in other languages in R.

df$X3T_Full_MR_Compl == FALSE

Notice the two equals signs (==). When two value operators (=, >, <, !) are placed next to each other in R, and many other languages, we aren’t assigning a value to an object; we are comparing the values between two different objects. In this instance, using two equals signs, if the two values are equal, it would produce a TRUE value; if not, then a FALSE. This variable which can only take the value of either True or False is called a boolean. When we tell R to compare the value on the right with this specific column, what it is mechanically doing is iterating through each row within this column, comparing the column present, and noting whether the conditional is True or False. In this way, it’s very similar to For Loops in other languages.

So, if I wanted to know which participants had incomplete MRI data, I could enter this:

df$X3T_Full_MR_Compl == FALSE
##   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [13]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [37] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
##  [73] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
##  [85] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [97]  TRUE FALSE  TRUE FALSE

It produces an array of TRUEs and FALSEs of the same length as the rows in the dataframe, because each individual TRUE and FALSE is telling us whether each row in that column meets the condition we defined. If we see a TRUE in the first position, we know that the first participant does not have complete MRI data. If we see a FALSE in the second position, we know that the second participant does have complete MRI data. Again, !is.na(df$GM_Volume) is doing the same thing; just answering a different question. If we run it alone, the output looks really similar:

!is.na(df$GM_Volume)
##   [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [37]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
##  [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE

So, we could theoretically plug just about any conditional statement in our subset approaches and subset the data as we wish:

# The nrow() command outputs how many rows the data frame has
# We're doing this to show that both approaches yield the same result

# Approach 1
df_compl <- df[df$X3T_Full_MR_Compl == FALSE,]
nrow(df_compl)
## [1] 25
# Approach 2
df_female <- subset(df,
                 df$X3T_Full_MR_Compl == FALSE)
nrow(df_compl)
## [1] 25

1.8.1 Exercise 3

Let’s create a subset of the data that only contains female participants:

df_female <- df[df$Gender == F,]
nrow(df_female)
## [1] 0

Oh no! What happened? Figure out the mistake and recode the command correctly.

'



'

Click for solution

1.9 Slight Break Time

That was a big lead up to the main event, but basically familiarity with R syntax and conventions is really important before we start conducting analyses in R. Hopefully you feel like you have a half-decent grasp on what the environment contains, how to use the console to enter commands, and where to look when you need help. We’re now going to get into analyses. We’re going to start relatively simple with analytic approaches that utilize qualitative, categorical variables as predictors and get progressively more advanced to analyses that use continuous, numeric, quantitative predictors. At this juncture, let me reward my captive audience with a cute seasonally-appropriate photo of my dog.

Sadie in a Christmas Sweater, circa 2020.
Sadie in a Christmas Sweater, circa 2020.

2 Analyzing Data w/ Qualitative Independent Variables

To start, we’re going to be looking at how to analyze data that includes categorical or qualitative variables. Hopefully, hearing terms like “Categorical” and “Qualitative” immediately brings back to memory a term I mentioned earlier: factor. As a reminder, factors are qualitative groupings which can often be ordered in a logical way. We’re going to be showing you how to code a Chi Squares, which has a categorical variable for both its dependent and independent variables. We’ll then examine how to code T-Tests, which look for statistically significant differences between two categorical groups in a quantiative outcome. Lastly, we’ll look at ANOVAs, which allow us to examine more than one group simultaneously. We’ll work in a touch of data visualization as well using ggplot2. You don’t need to know much about those tests; I’ll try to explain enough for you to keep up, but if you want to read more statistical theory, click each of their names to follow the hyperlinks.

2.1 Chi-Square Tests

A Chi-Square Test can be used when both the predictor and outcome, or independent and dependent variables, respectively, are categorical. They involve checking if the frequencies in our outcome or dependent categories match what we would expect to see if there were no differences between our predictor or independent categories. For lack of a more interesting association, within the context of our data, we could ask a question like whether There are differences between males and females in MRI data completion. Though perhaps not inherently interesting, it would be really important to know if some underlying component of our study design was disproportionately affecting one sex’s completion rate.

QUESTION: Do males and females differ in the rate of MRI data completion?

HYPOTHESIS: I have no reason to expect differences so we’ll hypothesize that there will be no statistically significant differences between the two.

RELEVANT VARIABLES: Dependent: df$X3T_Full_MR_Compl (Factor) Independent: Gender (Factor)

ANALYSIS: Chi-Square

There’s a whole ton of assumptions and quality concerns we usually want to test before we run a test, which we covered in the last intro to coding bootcamp we ran, but we learned the hardway that we really don’t have the time to review all this background information in just one day’s review, so I’m going to skip that for now, but you can always use the hyperlink above if you want to look at it on your own time.

The actual statistical test coding is fairly simple. We note our x variable and our y variable and then run the command. We are saving it as an object chi. NOTE: I have an extra set of parentheses around the whole statement. This tells R “When you’re done running this, print the output”. I hadn’t explicitly mentioned it previously, but if all you’re doing is assigning a value to a specific object, R won’t print what that object’s value is unless you explicitly tell it to. Here, because of the parentheses, I’m explicitly telling it to.

  (chi <- chisq.test(x = df$Gender, 
                     y = df$X3T_Full_MR_Compl))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df$Gender and df$X3T_Full_MR_Compl
## X-squared = 0.21471, df = 1, p-value = 0.6431

But because I’ve both asked it to print the result and saved it as an object, I can also call the exact same object later and see the results again.

  chi
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df$Gender and df$X3T_Full_MR_Compl
## X-squared = 0.21471, df = 1, p-value = 0.6431

On the first line, we see the two variables we were comparing named. On the second line, we see our test statistic (X^2 value), our degrees of freedom, and our p-value. Because our p-value is greater than 0.05, we’re going to say we do not find any statistically significant differences between males and females, which is great. Again, we ideally don’t want patterns behind our missing or incomplete data.

This is by no means, necessary, but I always like to write my journal-ready summary statement within my code so I can quickly reference it in the future and remember how I had interpreted the results in the past.

A Pearson's Chi Squared test with Yates' continuity correction determined that males and females did not significantly differ in the frequency with which they failed to provide complete MRI data [x-squared = 0.215, df = 1, p > 0.05].

2.1.1 Exercise 4

Let’s run a second Chi Square Test, this time examining whether there were differences in the number of males and females between the ages of 22 and 30. This is going to be a challenge and there is more than one way to do it. Remember, age is also considered a factor. Don’t be afraid to use Google.

'



'

Click for solution

2.1.2 Visualizing Chi Squares

While a Chi Square Test is great, verbalizing the relationship can only go so far to communicate how these variables are related. Where R really excels, relative to some other languages, is visualizing relationships. ggplot2 is one powerful tool to build visually-pleasing graphics one layer at a time.

Behold, the glory of GGPlot
Behold, the glory of GGPlot

This chunk below should probably look big and confusing, and that’s fine for now. We’re going to run it so we can see what the end product looks like. Then we’re going to walk through these lines and I’ll show you that it really looks much more intimidating than it is.

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) +
          geom_bar() +
          scale_x_discrete(labels = c("Female", "Male")) +
          scale_fill_discrete("Incomplete MRI Data", labels = c("Incomplete", "Complete")) +
          labs(title = "Incomplete MRI Data by Gender",
               x = "Gender",
               y ="Frequency") +
          theme_classic() 

As I’d mentioned, ggplot’s are built in layers. The first layer consists strictly of the plane in which the data will eventually be plotted. By entering where the data is being pulled from and the x and y values, we can lay the essential dimensions, but it’s still not something very useful.

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) 

We can add new layers by adding a plus sign (+) to the end of the previous line. Think of this as similar to a comma; we don’t necessarily have to move the next layer object to a new line, but it helps for readability. The next layer object is the geom. Geoms can be lines, bars, circles or whatever other shape you might imagine. They tell ggplot how the data should be plotted. In this case, we want a bar. By simply adding geom_bar(), we see some massive changes to the plot. It’s much more readable!

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) +
          geom_bar() 

However, the axis labels aren’t very descriptive. There’s no title. Someone who just happened upon this visualization wouldn’t be able to make much sense of it. We can fix this by labeling the different scales. We’ll start with the X axis. Because our variables aren’t continuous, but discrete, we’ll use the scale_x_discrete command. We’re going to concatenate the labels we want to see in their respective order (notice above female is to the left and male to the right).

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) +
          geom_bar() +
          scale_x_discrete(labels = c("Female", "Male"))

Next, we can make our fill variable, X3T_Full_MR_Compl, make a little more sense with basically the same argument. Notice the general trend in the commands is “scale_VARIABLENAME_VARIABLETYPE”. So if we had a quantitative variable within ggplot(data = df, aes() called color =, we would use scale_color_continuous() to modify it.

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) +
          geom_bar() +
          scale_x_discrete(labels = c("Female", "Male")) +
          scale_fill_discrete("Incomplete MRI Data", labels = c("Incomplete", "Complete"))

Amazing! Now in layer 5, we are just adding more labels. “Frequency” describes the y axis better than count and we might want a label on the plot.

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) +
          geom_bar() +
          scale_x_discrete(labels = c("Female", "Male")) +
          scale_fill_discrete("Incomplete MRI Data", labels = c("Incomplete", "Complete")) +
          labs(title = "Incomplete MRI Data by Gender",
               x = "Gender",
               y ="Frequency")

Lastly, it might be nice to get rid of that horrid gray background. This is where the family of theme() commands come in. We could apply a nice general pre-packaged theme like so:

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) +
          geom_bar() +
          scale_x_discrete(labels = c("Female", "Male")) +
          scale_fill_discrete("Incomplete MRI Data", labels = c("Incomplete", "Complete")) +
          labs(title = "Incomplete MRI Data by Gender",
               x = "Gender",
               y ="Frequency") +
          theme_classic()

… or this …

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) +
          geom_bar() +
          scale_x_discrete(labels = c("Female", "Male")) +
          scale_fill_discrete("Incomplete MRI Data", labels = c("Incomplete", "Complete")) +
          labs(title = "Incomplete MRI Data by Gender",
               x = "Gender",
               y ="Frequency") +
          theme_minimal()

… or we can get really crazy and manually specify the size, position, font, and face of just about everything in the plot.

   ggplot(data = df, aes(x = Gender, fill = X3T_Full_MR_Compl)) +
          geom_bar() +
          scale_x_discrete(labels = c("Female", "Male")) +
          scale_fill_discrete("Incomplete MRI Data", labels = c("Incomplete", "Complete")) +
          labs(title = "Incomplete MRI Data by Gender",
               x = "Gender",
               y ="Frequency") +
          theme_classic() +
          theme(plot.title = element_text(face="bold", size=13, hjust = 0.5)) +
          theme(plot.subtitle = element_text(face = "italic", size = 10, hjust = 0.5)) +
          theme(plot.caption = element_text(face = "italic", size = 8, hjust = 0.0)) +
          theme(axis.title = element_text(size = 12)) +
          theme(axis.text.x = element_text(size = 14, color = "Black")) +
          theme(axis.text.y = element_text(size = 14, color = "Black"))

These same elements are going to be used over and over again no matter what type of data you are plotting. COG has dedicated a few workshops to visualizing data in R, including the basic setup, and more advanced visualizations that animate, react, and become interactive, so if you’re really interested in learning more, I’d recommend looking at the resources from our 2021 Workshop and our 2022 Workshop.

2.1.3 Exercise 5

Let’s try to generate a plot for the test we can in exercise 6. Remember, we want to see Gender on the x axis and the frequency of the age groups on the Y axis. There are also at least two ways to go about this.

'



'

Click for solution

2.2 T-Tests

A T-Test can be used when both the predictor variable consists of two categorical options and the outcome or dependent variable is numeric in value. A T-Test tells you how significant the differences between these categories or groups are. In other words, it lets you know if the differences between the means of two groups could have observed by chance. We could imagine a situation where an evil teacher told half of the class before a test the right chapter to study from and told the other half of the class the wrong chapter to study from. The two categories or groups might be Right Chapter and Wrong Chapter and the outcome variable would be Test Score. Using a T-Test, we could determine whether studying from the right materials produces higher test scores. Within the context of our data, we could ask a question like whether males and females demonstrate significant differences in the volume of gray matter they possess.

QUESTION: Do males and females differ significantly in gray matter volume?

HYPOTHESIS: I’m an absolute idiot when it comes to structural neuroscience knowledge so I don’t really know what we might expect to find. A real pea-brain hypothesis might be “On average, men will have greater gray matter volume, simply because they likely have larger brains to fit in their larger heads”.

RELEVANT VARIABLES: Dependent: GM_Volume (numeric) Independent: Gender (Factor)

ANALYSIS: Two-Sample T-Test

We are again going to skip over a lot of assumption testing, but I do want to take a moment to show off one of the most helpful functions I’ve found for this purpose. The skim() function from the skimr package gives us a lot of information in small amount of space. We could run it on our entire dataset, or just a few variables. Since we don’t have that much, let’s run the whole thing and look at the output.

skim(df$GM_Volume)
Data summary
Name df$GM_Volume
Number of rows 100
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 8 0.92 687194.2 68400.49 509704 633850.5 685717.5 736829 841393 ▁▆▇▆▂

It looks far less impressive than it typically does with so few variables, but we quickly get to see the structure of all of our data, the number of missing values for each variable, the mean value, a small histogram of the distribution of that variable, the interquartile ranges, frequency of factors, etc. It’ll often replace several other functions for me. Again, though, because this isn’t a statistics course, I don’t want to get too into the weeds.

2.2.1 Okay, Now T-Tests

Okay, assuming that we’ve met our assumptions and decided that a t-test is an appropriate means of measuring our question, let’s run the actual t-test. It’s a little more complicated than our initial chi-square test, but maybe simpler than the chi-square exercise. We need to use conditional statements again to specify our variables. What we are comparing here are the mean values of gray matter volume for males and females. As such, we are going to specify we want to see gray matter when Gender == M and when Gender == F. We next have an argument which asks us whether this study is a within-subjects or a between-subjects design. This question is between-subjects, since each participant is either male or female for this study, so we mark that as FALSE. Lastly, R is asking us to define our alternative hypothesis, which is a little beyond the scope of this review, so you will have to take my word that “two.sided” is the right call. Lastly, when we look at T-Tests, standard deviations are very important, but the t.test() function won’t automatically generate those. We are using the sd() function to capture the standard deviation of reaction action, and we’re adding the argument (na.rm) that tells R to ignore any row that has a value of N/A.

  (ttest<- t.test(x = df$GM_Volume[df$Gender == "M"],
                 y = df$GM_Volume[df$Gender == "F"],
                 paired = FALSE,
                 alternative = "two.sided"))
## 
##  Welch Two Sample t-test
## 
## data:  df$GM_Volume[df$Gender == "M"] and df$GM_Volume[df$Gender == "F"]
## t = 9.1271, df = 89.348, p-value = 1.985e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   73662.87 114658.20
## sample estimates:
## mean of x mean of y 
##  739391.9  645231.4
  sd(df$GM_Volume, na.rm = T)
## [1] 68400.49

My pea-brain hypothesis panned out! We do see statistically significant differences, judging by that infinitely small p-value (t = 9.1271, df = 89.348, p-value = 1.985e-14). Now, like last time, we could write these results ourselves, but let’s take a moment to pull out a nifty little tool from the report package (that unfortunately doesn’t work with chi-squares or we would have used it earlier). report() will save us the trouble and summarize the results of our test (albeit a little imperfectly) for us.

report(ttest)
## Warning: Unable to retrieve data from htest object.
##   Returning an approximate effect size using t_to_d().
## Effect sizes were labelled following Cohen's (1988) recommendations.
## 
## The Welch Two Sample t-test testing the difference between
## df$GM_Volume[df$Gender == "M"] and df$GM_Volume[df$Gender == "F"] (mean of x =
## 7.39e+05, mean of y = 6.45e+05) suggests that the effect is positive,
## statistically significant, and large (difference = 94160.53, 95% CI [73662.87,
## 1.15e+05], t(89.35) = 9.13, p < .001; Cohen's d = 1.93, 95% CI [1.43, 2.43])

Again, not necessarily formatted in a way that’s ready for publication, but it automatically compiles a lot of the relevant statistics in a way that saves you a little bit of coding and effort.

2.2.1.1 Exercise 6

I’m curious whether there are differences in gray matter between our second oldest and youngest age groups as well (I’d normally do oldest and youngest, but we don’t have enough observations from 36+ people to run the test!). Run another t-test to test this out and try to interpret the output yourself. You can run report() to try to see whether your interpretation matches.

'



'

Click for solution

2.3 ANOVAs

An ANOVA, or Analysis of Variance, can be used when both the predictor variable or variables consist of two or more categorical options and the outcome or dependent variable is numeric in value. Much like a T-Test, ANOVA tells you how significant the differences between these categories or groups are. The advantage over T-tests is that we can compare multiple groups or categories in one analysis. We could revisit our last horrible example and imagine that the evil teacher tells one group right chapter to study from, one group the wrong chapter to study from, and one group to not study at all. An ANOVA test will tell us whether any of these three groups are different from one another (but not necessarily which specific groups are different from one another). Within the context of our data, we could use this type of test to determine whether there are any age related differences in gray matter volume (spoiler alert: We they were only trending statistically significance in the last exercise).

QUESTION: Are there differences in gray matter volume by age group?

HYPOTHESIS: Differences will exist in gray matter volume by age group.

RELEVANT VARIABLES: Dependent: GM_Volume (numeric) Independent: Age (Factor)

ANALYSIS: ANOVA

Pay close attention to the formatting of the syntax here. It is the standard way in which we specify most statistical models in R, whether for regression, ANOVA, hierarchical modeling etc.

  aov <- aov(GM_Volume ~ Age, data = df)

The Base R command we use to specify that this is an ANOVA model is aov(). We then place our outcome/criterion/dependent variable next, followed by a tilde (~). Following the tilde comes the predictor/independent variable(s). Once the model has been specified, we note its end with a comma and tell the model where the data exists. Now if we run this model, we’ll see a new object by the name aov. But when we call that object…

  aov
## Call:
##    aov(formula = GM_Volume ~ Age, data = df)
## 
## Terms:
##                          Age    Residuals
## Sum of Squares   32640421161 393114596276
## Deg. of Freedom            3           88
## 
## Residual standard error: 66837.2
## Estimated effects may be unbalanced
## 8 observations deleted due to missingness

… the information isn’t really formatted in a way that’s immediately meaningful or understandable. We typically look towards metrics like p values or means to understand ANOVA results and those are not present here. In order to see those, we need to summarize the ANOVA object.

  summary(aov)
##             Df    Sum Sq   Mean Sq F value Pr(>F)  
## Age          3 3.264e+10 1.088e+10   2.436 0.0701 .
## Residuals   88 3.931e+11 4.467e+09                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 8 observations deleted due to missingness

Now we can see the variable we were exploring to the left with it’s related degrees of freedom, sum of squares, mean square value, F vlaue, and p value to its right, respectively. The summary function uses the significance code at the bottom, so we see we again only find a relationship that trends significance.

The report function works equally well on ANOVA objects as well.

  report(aov)
## The ANOVA (formula: GM_Volume ~ Age) suggests that:
## 
##   - The main effect of Age is statistically not significant and medium (F(3, 88)
## = 2.44, p = 0.070; Eta2 = 0.08, 95% CI [0.00, 1.00])
## 
## Effect sizes were labelled following Field's (2013) recommendations.

We can get a little crazier with our models though. What if we wanted to explore the main effects of age and sex on gray matter volume? We could respecify our model like this:

  summary(aov(GM_Volume ~ Age + Gender, data = df))
##             Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Age          3 3.264e+10 1.088e+10   4.639  0.00468 ** 
## Gender       1 1.891e+11 1.891e+11  80.605 4.98e-14 ***
## Residuals   87 2.041e+11 2.345e+09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 8 observations deleted due to missingness

All of a sudden, when we account for differences in gender, we begin to see a more pronounced main effect of Age! That’s pretty interesting! Of course, we don’t know which age groups are significantly different from one another, for that we need to code contrasts but that is well beyond the scope of this tutorial (maybe one day in the future; R currently doesn’t have any good Bonferonni Contrast packages, as far as I know, so we’re working on one). For now, though, let’s talk about…

2.3.1 Visualizing T-Tests and ANOVAs

At it’s core, it’s going to be pretty similar to Chi-Square tests. Let’s just visualize the Age GM_Volume ANOVA results using a violin plot, which tells us what the distribution of data looks like, as well as the mean value. Notice, the only non-cosmetic (i.e., not title or variable) thing we’re changing is the geom: we did away with the geom_bar and replace it with a geom_violin (we’re also adding a trim argument to prevent it from extrapolating observations). To visualize the mean, we add a stat_summary command.

ggplot(df, aes(x = Age, y = GM_Volume)) +
    geom_violin(trim = TRUE) + 
    stat_summary(geom="point", fun = mean) +
    scale_x_discrete(labels = c("22-25 yrs", "26-30 yrs", "31-35 yrs", "36+ yrs")) +
    labs(title = "Differences in Gray Matter Volume by Age",
       subtitle = "",
       x ="Age Groups", 
       y ="Gray Matter Volume (units)") +
    theme_classic()
## Warning: Removed 8 rows containing non-finite values (`stat_ydensity()`).
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Removed 8 rows containing non-finite values (`stat_summary()`).

2.3.2 Exercise 7

Notice the order of our commands above: what do you think would happen if you put stat_summary before geom_violin? Make a prediction, then try it out and see. When you finish that, make a visualization for our t-test (Gender differences in gray matter volume).

Click for solution

3 Analyzing Data w/ Quantitative Independent Variables

We are nearly done for the day, but before we close up shop, we need to cover analyses that use quantitative, numeric predictors, probably most common of which is linear regression. Regression can come in many flavors, including bivariate linear regression, multivariate linear regression, and binary logistic regression. Once again, I won’t get too much into the theory, but hyperlinks are provided for those who want it. We will demonstrate some useful tools and syntax to get you prepared to use R on your own.

3.1 Bivariate Regression

A bivariate linear regression can be used when both the predictor variable (X) and outcome variable(Y) consist of continuous numeric values. A linear regression tells us how predictive of Y that X is. In other words, if we measured temperature and ice cream sales, we might find, using linear regression that as temperature increases, we could predict with decent accuracy that ice cream sales would increase as well, and we could predict how many ice cream sales we expect to see for any one value of temperature. Within the context of our data, we could ask a question like whether one’s sleep quality affects their gray matter volume.

QUESTION: Does sleep quality predict gray matter volume?

HYPOTHESIS: As sleep quality increases, gray matter volume will increase.

RELEVANT VARIABLES: Dependent: GM_Volume (numeric) Independent: PQSI_score (numeric)

ANALYSIS: Bivariate Linear Regression

You’re going to notice right off the bat that the structure of the syntax here looks awfully similar to what we just did in ANOVA. We start by noting our method with the lm() function (Linear Modeling). We then note our outcome variable, add a tilde (~), note our predictor(s), and finally note our datasource.

m1 <- lm(GM_Volume ~ PQSI_score, data = df)

Just like ANOVA, we need to use a summary() function to read the data.

summary(m1)
## 
## Call:
## lm(formula = GM_Volume ~ PQSI_score, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184261  -49863    4589   49437  142618 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   705991      16223  43.519   <2e-16 ***
## PQSI_score     -2405       1866  -1.289    0.201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68150 on 90 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.01812,    Adjusted R-squared:  0.007213 
## F-statistic: 1.661 on 1 and 90 DF,  p-value: 0.2007

I find the output of summary to be easier to interpret for linear models than I do for ANOVAs, but it could still be cleaned up a little further using the stargazer package. The stargazer() function provides a much more audience-friendly way of sharing regression output that’s easily exportable.

stargazer(m1, 
          type = "text", 
          title= "Sleep Quality Predicting Gray Matter Volume",
          column.labels = c("Model 1"),
          dep.var.labels = "Gray Matter Volume", 
          star.cutoffs = c(0.05, 0.01,0.001),
          covariate.labels = c("Sleep Quality (PQSI)", "Intercept"),
          notes = "",
          out = paste0(Path,"RegressionTable.html"))
## 
## Sleep Quality Predicting Gray Matter Volume
## ==================================================
##                           Dependent variable:     
##                      -----------------------------
##                           Gray Matter Volume      
##                                 Model 1           
## --------------------------------------------------
## Sleep Quality (PQSI)          -2,405.127          
##                               (1,866.089)         
##                                                   
## Intercept                   705,990.800***        
##                              (16,222.750)         
##                                                   
## --------------------------------------------------
## Observations                      92              
## R2                               0.018            
## Adjusted R2                      0.007            
## Residual Std. Error      68,153.350 (df = 90)     
## F Statistic               1.661 (df = 1; 90)      
## ==================================================
## Note:                *p<0.05; **p<0.01; ***p<0.001
## 

We see that for every 1 unit increase in sleep quality, we can expect a 2,405.127 unit decrease in gray matter volume. However, because the standard error is so larger (1866.089), we cannot say that this relationship is statistically significant. Let me take a quick step back though and outline what I just did.

We start by specifying our model. If we have more than one, we can simply add them one after the other…

stargazer(m1, m2, m3, 
          type = "text", 
          ...

Type will modify how the table is constructed (I almost never have a reason to change this). We then can note various titles and labels and finally, where we want to export the table to. This of course is entirely optional; you do not need to export anything, but for the point of demonstration, let’s try exporting it to our Path directory. To make it simple, I’m using the paste0() command which will concatenate multiple strings into a single string with no space between them (contrasted to the paste() function which allows you to use separators like underscores or hyphens between the strings). If we type that same command into our console, you’ll see example what I mean.

paste0(Path,"RegressionTable.html")
## [1] "C:/Users/Administrator/Documents/Github/intro-to-coding-2023/data/RegressionTable.html"

Now, outside of R, let’s navigate to our respective directories and check that the regression table saved. Once you locate it and open the .html document, it should look like this.

Regression Table output
Regression Table output

Lastly, we can run the report() function for linear models as well!

report(m1)
## We fitted a linear model (estimated using OLS) to predict GM_Volume with
## PQSI_score (formula: GM_Volume ~ PQSI_score). The model explains a
## statistically not significant and very weak proportion of variance (R2 = 0.02,
## F(1, 90) = 1.66, p = 0.201, adj. R2 = 7.21e-03). The model's intercept,
## corresponding to PQSI_score = 0, is at 7.06e+05 (95% CI [6.74e+05, 7.38e+05],
## t(90) = 43.52, p < .001). Within this model:
## 
##   - The effect of PQSI score is statistically non-significant and negative (beta
## = -2405.13, 95% CI [-6112.44, 1302.18], t(90) = -1.29, p = 0.201; Std. beta =
## -0.13, 95% CI [-0.34, 0.07])
## 
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.

3.1.1 Exercise 8

Let’s throw another big challenge at you all. We have not gone over creating visualizations for regressions, but, as I mentioned, a lot of ggplot syntax is the same, no matter what type of plot you want to create. In many ways, linear models are simpler to craft in ggplot. Try to visualize the model we just created. Make sure to talk to others and use all of the resources at your disposal if you get stuck (Google is your friend!). Make sure to save it as an object named “plot”.

'



'

Click for solution

3.2 Multivariate Regression

A multivariate linear regression builds upon bivariate regression by allowing for multiple predictors. If we measured temperature, ice cream sales, and the time since someone last ate, we might find that our previously specified model is now even more accurate because of the addition of the “time since last ate” variable. Within the context of our data, let’s reexamine how the sleep quality of men and women affect gray matter volume.

QUESTION: Does the interaction of sex and sleep quality predict gray matter volume?

HYPOTHESIS: Another pea-brain hypothesis: men have more to gray matter volume to lose, so I predict sleep quality will disproportionately affect their gray matter volume more than women.

RELEVANT VARIABLES: Dependent: GM_Volume (numeric) Independent: PQSI_score (numeric) Independent: Gender (factor)

ANALYSIS: Multivariate Linear Regression

Once again, extremely similar to what we’ve already seen

  m2 <- lm(GM_Volume ~ Gender * PQSI_score, data = df)
  summary(m2)
## 
## Call:
## lm(formula = GM_Volume ~ Gender * PQSI_score, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -143593  -34468    1789   36339  117956 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        668309.0    15939.7  41.927  < 2e-16 ***
## GenderM             92884.6    23545.2   3.945  0.00016 ***
## PQSI_score          -3002.4     1869.9  -1.606  0.11192    
## GenderM:PQSI_score    268.9     2699.7   0.100  0.92088    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49220 on 88 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.4992, Adjusted R-squared:  0.4821 
## F-statistic: 29.24 on 3 and 88 DF,  p-value: 3.29e-13

Unfortunately, we did not support our hypothesis! Gender does not interact with sleep quality. We now have more than one model, so we’re going to change our stargazer to reflect that. I’ve already demonstrated how to export a table, so we won’t review that.

stargazer(m1, m2, 
          type = "text", 
          title= "Predicting Gray Matter Volume",
          column.labels = c("Model 1"),
          dep.var.labels = "Gray Matter Volume", 
          star.cutoffs = c(0.05, 0.01,0.001),
          covariate.labels = c("Sex, Male", "Sleep Quality (PQSI)", "Male * PQSI", "Intercept"),
          notes = "")
## 
## Predicting Gray Matter Volume
## ================================================================
##                                  Dependent variable:            
##                      -------------------------------------------
##                                  Gray Matter Volume             
##                            Model 1                              
##                              (1)                   (2)          
## ----------------------------------------------------------------
## Sex, Male                                     92,884.560***     
##                                                (23,545.200)     
##                                                                 
## Sleep Quality (PQSI)      -2,405.127            -3,002.450      
##                          (1,866.089)           (1,869.873)      
##                                                                 
## Male * PQSI                                      268.905        
##                                                (2,699.696)      
##                                                                 
## Intercept               705,990.800***        668,309.000***    
##                          (16,222.750)          (15,939.680)     
##                                                                 
## ----------------------------------------------------------------
## Observations                  92                    92          
## R2                          0.018                 0.499         
## Adjusted R2                 0.007                 0.482         
## Residual Std. Error  68,153.350 (df = 90)  49,223.510 (df = 88) 
## F Statistic           1.661 (df = 1; 90)  29.239*** (df = 3; 88)
## ================================================================
## Note:                              *p<0.05; **p<0.01; ***p<0.001
## 

3.3 Exporting Visualizations

One last function I want to highlight is how to export these plots. We could right click on a plot and ‘Save Image’ but the quality is usually atrocious. They are pixelated and disproportioned. My preferred method which produces beautiful quality, poster ready figures is to export the images as a pdf using the code below.

  setwd(Path)
  pdf("Plot.pdf", width = 8, height = 4.5)
  plot
## function (x, y, ...) 
## UseMethod("plot")
## <bytecode: 0x000002763040ed48>
## <environment: namespace:base>
  dev.off()
## png 
##   2
Beautiful.

4 Troubleshooting w/ Google

The last thing I want to highlight is that effective coding is not about memorization. If the internet disappeared and my scripts got erased, I’d be at a total loss to rewrite most of my R scripts, and that’s okay. Coders have built robust communities that function as very valuable resources for any questions or issues that you run into. StackOverflow is a great general resource. Believe it or not, Google and Youtube are some others. Most of the problems I run into can be solved by a simple Google search, which often lead me to coding and stat blogs and vlogs I would have never encountered otherwise. It seems one of the absolute best skills you could develop is knowing how to search for solutions to your coding problems, and that comes with time and practice.

Don’t be ashamed, everyone does it.
Don’t be ashamed, everyone does it.

5 Closing

Congrats! You made it through an entire statistics coding bootcamp while potentially not really know any statistics! That’s a herculean feat. Hopefully this was more encouraging than discouraging, but should you find yourself more on the discouraged side, don’t hesitate to shoot me an email (). Happy to help, and thanks for sticking with it.

PICTURED: You, on your way to success
PICTURED: You, on your way to success